Okay. Okay, uh for for uh the noise sensitive table project um we decided to um to take the noise uh as input, and uh give a feedback uh with LEDs on the table. So this is uh one of the the possibilities that were offered to us. But uh we decided to do to do uh it this way. Yes, the the first uh the first idea of the project uh was to have um the whole table covered by LEDs. But uh for the prototype of this isn't this is not the the way we implement it now. But uh in fact we have a there was th yes, th there was the the restriction that if we cover the le the table with LEDs, uh some won't be visible because uh you have uh yes. So um if there is information displayed under the the sheets uh the it won't be um very useful. So we decided to have it uh yes, thirty centimetre uh at each side. weeks.. This is for prac practical reason and uh also a matter of time. Because uh Yes. But uh th there are o other problems, like um yes, the the information display on the sheets. So we don't know if it's Yes. True. And there is uh a lot of programmes now about this implementation. Is it's not uh it's not obvious. Hmm? Still in progress. Almost done. Mm-hmm. Yes. Mm yes. Yes. Uh I have one or or two question. I don't know if you want to to begin with the the explanation about the the release. O okay. So um I have actually two questions that are uh just practical question abo about my um my implementation now. Uh i No. So and the first information is um about the the noise level. N not the noise level, the sound level. Uh because uh what you get from from the microphones is uh a curve. And um a way the way uh it is implemented now uh to detect yes, you have a uh I have packets uh for uh let's say uh this is uh this is uh not uh fixed. But uh I have packets for about um uh alpha second. So it's uh second. And uh on these packets I want to know the the sound level, the average sound level. So uh the way I do it now is uh to to take uh each top of the curves' bottoms and uh just making leverage this. But I don't know if it's it's not really um the best way to do it. But I don't know uh if Yes. Okay. So this is this the the first question. Yes. But I think uh this is two different problems. Uh the filtering must be must be done before. And after that uh if you have uh a cleaned curve and you can uh just apply uh functionally. And uh the second question was about um uh filtering and uh F_F_T_ yeah, what you can get from it. Uh general questions. Okay. Mm-hmm. Uh o yes, I will just add something. Um is there a way to uh to from one uh from one stream to uh to build two streams that correspond to the the voice of one people and the voice of the second? the Yes. No. And it it this is not uh a sound question. Sound related question. Line in. Uh yes. The the communication between the the extension and is uh of the. No, no, no. Uh okay. And the fire-wire port. And there is an extension for this this uh this device connected to it uh with uh optic fibre. Yes. Uh yes. Be because for now uh I'm just sampling the sound that uh eight thousand. So uh just uh something very very rel I could uh even uh go to lower. But uh this is just to get the the s the noise, the sound level. But uh if I want to record the conversation with uh uh a processing after th Yes, to to have a better sound quality. Uh perhaps it's better to have a X_ array, yeah. Something like that. Yes.. Okay. Mm-hmm. Sure, sure. But uh what is the the gain? In terms of uh position? Uh uh uh centimetres or uh more? Okay. M uh what you get is uh the real position or uh just uh the angle at which the the sound comes? Mm-hmm. Mm-hmm. Okay. Yes, sure. Yes. Uh because uh there there is just um four microphones that are eight. Okay, yes. Okay, but there are in uh in a plan. Okay. Because uh at at T_P_F_N_ there is a lab that is uh has done almost the same thing. But they have uh eight microphone that are uh at uh yes, the eight um the eight uh uh cor corner of uh a cube, yes. The elevation, no. But uh what is could be relevant, it's not uh almost it's not always the case with the the um Yes, the the position, the the radius of the Yes, it could be. Yes, okay. Yes. Because um because i the um the distance between the the microphone is too uh is too low to Mm-hmm. Yes, sure. Yes, but I think the the most relevant part is the um the other uh the hardware that synchronize the microphones and that. Yes. Yes. Mm not maximum. The the average on the on the packets. Uh yes, the the I take the the b I g when I go up to the the curve uh I'm uh on the at the local maximum. I take it. Then the next uh local minimum and uh taking this uh this range, doing average on it. Yes. Difference. Difference is uh is that if I just uh Mm yes. Yes. Okay. A you have uh you have fifteen implementations. Uh Yes. Okay. Sixteen. Sure. Okay, okay. Yes, and then I just uh because uh we are using um uh an open s an open source framework to do this, there is uh something, I don't know if you know i know it, uh calls Sphinx, that is doing uh s sorry. Yes. Doing um voice recognition. And uh I just use it for the for the for the microphone and the acquisition. So uh there is uh already uh an F_F_T_ uh module implemented in it. I didn't test it. But um Yes, sure. Y yes, for the for the the position,. Also for detection. Mm-hmm. Mm-hmm. 'Kay, yes. Sure. Okay. Yes. No. Yes. Okay. Y you have to detect uh an event on one microphone and uh know when it's happened on uh on the other one. No? Okay. Mm-hmm. Mm-hmm. It's related to the number of microphones around. No. Okay. Mm-hmm. Okay, this is a measure of the activity o okay. Th there there is a distance. But it's not very uh very effective. As as you mentioned. It's And uh from from this kind of if information you can uh let's say have uh a finger print of the user to know uh if this is the same user speaking. So Not sure. Because the the um Okay. Which range.. Yes, okay. So This is this is a bit arbitrary. Because uh perhaps uh person two will have uh some frequencies in the range of Okay. And if you reconstruct the stream from this uh this range, you will uh get something uh that is uh Okay. Mm-hmm. And you can you can um this way you can filter uh perhaps some range that are uh just very low frequencies for uh perhaps uh a noise or The yes, the fan. Okay. Yes, but but then this this is very interesting because uh this way you can detect uh if someone is talking, if someone uh has an uh some noise with uh perhaps putting his uh his cap on the on the table. Perhaps if there is a laptop and uh in one position And you can classif uh classify. And this is Yes. She' very good. Okay. Yes. Yes. And the the length of the time frame has um an inference of uh on uh which parameters Okay. Th there i uh let's say if you use a one second window. Uh the in fact the the frame lengths will be uh very uh very uh i Yes. Okay. They they won't be visible on the on the wavefo okay. Yes. But the the the the long uh the longest frame you would use would be one hundred. At most, okay. Okay, yes.. 'Kay, so you can yeah. Okay. Enter uh when user is Uh yes. And before I forg forget it uh I have one question. Uh you say you do the processing after uh afterwards. And uh what is the the cost, yes, the the length? One one or uh bit longer? Okay. Yes, but Matlab is very uh very effective for uh matrices and uh Okay. Yes. Okay. Yes. Relative, yes. Mm. Yes. And even if you use uh four microphones to get uh not good. Okay. Okay. And eight is uh is good. There i yes. But uh Yeah. Yes, one, two degree is very very small for uh for our application. Because we want we really want to detect uh if th this is the same person that is speaking or not. If there is uh an error of uh let's say uh ten, fifteen degrees and but uh it is um capable of uh seeing it as two different person. There is not problems with the Okay, and the number of sectors is uh is not dependent of uh the number of microphones? Okay. If there is uh two two person the same sector, I see. Mm-hmm. Mm-hmm. No. Geometry is not. We have not uh thought a lot about it. Yes. F_F_T_. But th uh uh I think what is very important here is the synchronization of the microphones. Right. Okay, um uh I'm speaking about it because uh we wanted uh to do this table in uh a mid-tech way, let's say a cheap way. To have not uh very expensive tables. So I don't know if um uh what I've seen w on our website was uh that the this installation was uh a bit expensive. Uh but I'm trying to uh to to to see what is very important and uh what is not uh for Yes. Sure. Yes, sh I think you have to uh to uh It sh These are these are um some kinds of uh and this one. Yes. Okay. Expensive. Because because uh of the quality of the the And they are plug in the on the X_L_R_? Or uh Yes,. Okay, for the our we'll we we'll uh. Mm-hmm. Let's have a break.
No. I don't uh Well I guess it's just a restriction we took. So so we don't uh the m amount of money we have to spend. Otherwise we could have uh we could imagine to have a whole table of of uh s well, of L_E_D_s. And then y you can really start to to some regions of the table. So yeah. Oh let's. Does Yeah, I can just explain my project. It's completely different from this one. Well, it's also about tables, but the idea is if uh uh the several students come to a table with their laptops, they should be able to to share information easily. And uh in fact w what we we are doing now, we project um another screen on on the table, and different people are are are able to use their their mouse of their laptop to go to the to the project screen there and uh the project screen is just another Windows machine that is running. So you just have another f computer more or less you can use, but you can use it with your your own mouse and keyboard. Exactly. But uh in contrary to to having al another keyboard and mouse that only one person can use, now the different persons can use the the screen um at the same time with the with the mouse. I'll s still the same subjects. It's it's collaborative work. But uh It has nothing to do with uh audio processing and uh L_E_D_. So Anyway, uh I think the the details of the implementation is not that important. It's more important that first we want to to have some events like two people are talking or one one started and another started and uh well, more or less broke into the speech of the first person, and so on. And then you can start to visualise these uh different situations. Well that's uh I guess it's first prototype i you see we start with uh people two people talk or and don't and don't they. Uh afterwards uh one could imagine that you for example you can detect that someone is a bit angry or is uh yelling. And wi with different colours with uh with different colours on on table you could uh display this this uh give some feedback. But uh Yep. That's uh Once we get the hardware. Mm. Yes. Everything from um multi switch administration of uh audio and how you can uh analyse it, uh what information you can get out of speech and so on. So Probably it's uh it's f be a good idea if you have s two uh with um with source like uh books or web page centre. Give a lot of information how it is audio programming and so on works, cetera. Maybe you you know some at least a good book that's uh dealing with this stuff. Mm-hmm. And is possible to do this in real time. That's uh No. Yeah, but what happens if o someone is standing then? Yeah. No, no, no. In fact uh that's uh it's another thing, this optic. Well uh m we have Ye Actually, it's not really an extension to the first box, just an indu industry standard uh uh protocol between these two uh Well well there's a special protocol to to deliver uh audio signals and uh its path through uh o um Well actually, just to have the the ICSLA e uh possibly to to branch ICSLA. Hmm. No. Uh-huh. Well uh so you say this one already works to uh to tell the the direction. Um would it also work if you placed uh the microphones at at the border of the table? You see, now you you have it quite central. And uh what happens if if you put them here on around the table? Would it work as well? Or You just didn't try? Or Mm-hmm. Yeah. Yeah, it is clear, this. No. Mm-hmm. No. You lose information. Um. Yeah. Well that's something with F_F_T_s. And the phase. So the magnitude you don't really use? Or and for this uh Well uh uh uh I did a speech processing uh course and I'm not sure if they a uh they also um mm learns that that. We more used the face than the magnitude. So uh I guess the Uh-huh. Mm-hmm. And what happen uh if two person are talking at the same time? This it just get two sectors that are higher than others, no? Sounds. Well uh uh Mm-hmm. Yeah. No, it's more than. No, it's uh Mm-hmm. I'll show you this afterwards. Now if you don't really detect the pitch, ch can say that uh no persons.. It's not really meaningful. Well if you do this. Somehow. Mm-hmm. Mm thanks. Well, otherwise it could also just rely on the frequency band separation and detect that way the different persons and th then you don't really know where around the table these. But uh Well d depends with how you do it. But if if you place the the microphones uh n yeah. Uh That way you can separate where where is the the voice loud with uh or the loudest. Well it's quite different thing and uh you don't really get uh a name for it. But uh At the end it it depends a bit what you want. If you are I guess it's completely other approach. So uh Expensive? Ca can you uh say a number? Okay..
Okay. Should I start? Maybe by presenting the project little bit. Okay, so historically it was um uh last year Prof Delambeaux in uh his C_S_C_W_ classes uh asked the student to design a table. And all the students, and Guillaume wa was a student of this class, was very excited uh to design such a table. So he decided to um to make more research on this topic. So the general topic uh is about uh interactive table, um disappearing computer, um what is the other one. Um um na uh in the general fi field of uh collaborative computer supported collaborative learning. So the idea is we have different people collaborating uh in a pedagogical purpose. And uh we know that when we put a computer in front of them, uh it doesn't help focusing on collaboration. So if we make the computing um functionalities disappearing on on the in the furniture, in the wall or other things like that, we may more efficiently support uh collaboration than the the learning. So I arrived uh in June, something like that, a couple of month ago. Uh Guillaume and uh Michael uh are two students that are working uh four four months to their master project. And they start working on two uh different uh sub-projects. So Guillaume is working on the noise sensitive table. It's uh what we are discussing about uh today. So the idea is uh okay, w we have in the lab a notion about um root mirroring. The idea is if we can provide a feedback to people working together, uh it may help them self-regulating their activity, their collaboration. So could be normative or not. In our own case we want to have uh a table that would be accessible by by students. Uh at O_P_F_L_ we are uh At O_P_F_L_ uh the learning centre will be built in a couple of uh years. So the idea we we will have plenty of rooms uh like in libraries or meeting rooms where student will arrive and work. So if we have this table uh that is accesi accessible to students if they arrive, if we provide a feedback of um who is talking, um instantly the uh at one moment or uh through time or the turn taking, uh maybe they will just by seeing an explicit uh feedback of that, they will um take profit uh of it and um Yes. The idea is to have uh peripher peripherical uh information. We don't want people to focus on uh what's going on, uh what on the feedback. So we don't want histogram or very precise information. But we will uh have LEDs that will be distant. There will be six centimetres between each LED with a blurring glass. The idea is to have more um a kind of uh art, artistic uh feedback, to have a kind of general feeling of what's going on. Something about Uh dance. For practical reasons. We kept uh one paper sheet. So uh the first shape of the table the shape of the table obviously is one of the factor on that will change the collaboration. So uh but we don't want to start with too many parameters. So we start with a table that is approximatively the tab uh of the table where we are now. And we will have uh a square shape like this that will be here and a second one that would be there. So we will have something like this. Like this. That will be the our first uh real prototype. Okay. And the so this L_E_D_ modules are being uh designed uh now. We should have them in a couple of weeks. And uh we will be able to uh start uh evaluating uh the prototypes. Uh uh about the C_S_C_W_ class, that will start at the end at the beginning of November. Uh so the student will be asked to design uh a table. So to find the shape uh of the table to get uh who the and put that uh on the um on table uh legs. And we will uh so at the beginning we wanted to put this L_E_D_ modules uh in the tables. But we won't make it because uh for solidity reason of the table. You know, it's not very uh strong uh wood. Um so we will rather beam uh the L_E_D_ on the table. So we come back to a more richer display. B but we will um restrict it to a small uh to have the same display. LED lights uh that will be distant uh six centimetres between LEDs. So we will simulate our And we have yeah. And we have the feeling that to have the information embedded in the table, make it part of the table, it I don't know, we'll we we will try uh we don't really know what uh will happen with this prototype. We are very excited about that. Okay, uh maybe before uh going on on this subject we can talk talk a little bit on uh Huh. So based on the fact that most of the P_F_L_ student have their own laptop uh the idea is the similar idea uh is to okay, people are coming on a should be able to very easily uh use some um resources offered to them. So so the idea is to help them socialising, organising their own work, and uh not in a classroom setting, but in a social place where they could gather. Different directions. Okay. So uh maybe we can focus a little uh uh bit more on the what we have done about uh the noise sensitive table. Tell you about uh what we're interested in on the so we start so Guillaume started by uh developing the application that get the audio input. So we want it to have a modular architecture to be able to improve each uh layer separately. So the audio input uh is quite basic, no? We u we use uh sound cards to to get uh the signals and to detect to uh on to ha and we have a threshold to detect uh when someone is talking, at least when the noise is uh at uh is bigger that uh specific levels. And then we get this information about people talking, not talking, and to say okay uh, this one is talking for X_ time or uh has been talking uh X_ percent of the five last minutes. Or two people are talking together. So this is more semantic uh layer. Uh we call that interaction uh models. And uh uh so when different conditions are trigger are detected, we trigger on the interaction event. So very simple one could be one person is talking. So it's uh the easy one. It's still in progress. Still in progress. So the last layer is about this visualisation. W we will uh we are thinking about the visual grammar, meaning that when we have an interaction event that is uh fired, we uh will associate uh a way of visuali visualising it. So for instance if someone is talking, uh we can just have a light that will centre on the point that is uh close to him. That could be an uh instant uh feedback. And if he has been talking maybe seventy percent of the last five minutes, as me, uh uh it it may be reflected as the um intensity or size of a s a cycle. Centre on the same point. And then if there's turn taking, we can show waves of light going from uh me to you uh or from other people. Or using the centre part of the table to show some something more about the um uh more general uh dynamic. Or laughing. Yes. In uh we can change inte so we have uh eight by eight uh two times LEDs. Each uh LED is actually three LEDs um. Th uh M_G_B_ LED. And we can change the intensity. So it's already quite a rich uh feedback. Yes. And just a last thing. So this table is uh two different things. It's uh one object that will uh g provide a feedback to the terminal users, people collaborating around uh the table. But it also a tool for us as a researcher to test uh both the interaction event we want to detect and to test the visualisation. So um we will be able to edit some uh interaction rules and the visualisation uh grammar. And uh we will be able to process to post-process some conversation just to see what happens and to uh have these um models on feedback working when people are using the table. So I think it was Yes. Um Maybe we could summarise the question we were thinking about. So basically what we want we want to know when A_ is talking, B_ is talking and uh Basic information we need.. Maybe we can uh go through the different questions and then you can organise your own? Um I would just add something. Uh we are not interested so much in sound level but uh in the level of uh the sound that corresponds to voice. Mm-mm. Okay, it was just a Okay. Go on. And I would add uh one more. It's about um because what we can uh record is uh sound uh origins. So in po uh so no sonic points. But what we are interested in is people. So the question is if someone is moving, I'm talking here, I stop talking, I'm going here, I'm talking. How to detect that it's the same people. If we want to uh integrate information to say these people has uh spoken seventy percent of the time in the five last minutes. Obviously we need to uh to know that it's the same people. So right now our practical solution is to have one uh microphone close to um every place where people are supposed to sit. So we can infer that if a sound is coming in this area, it's the same. So kind of trick. But for the beginning it's enough. And a last question that is related t uh our last if I not find another one, um About people, if someone is leaving or if someone uh maybe it would be something ev uh quite complex even for you. If someone is staying here, but is quiet or leaving, how can we detect that someone is here but quiet? Okay. I know. But how to integrate the information? Uh Not not yet. Maybe later on. But Oh, the wait and the sit.. Or putting his bag on the okay. But it was just to it's a question. I'm not sure we will talk too much about that today. Uh not directly. But it's uh related. So in a couple of days we will get this um uh the fire-wire uh box. So we will we have chosen one that has eight uh line up. So we'll be able to have line in, sorry. So we will be able to plug uh up to eight uh microphones with uh jack plugs to exhale air with the possibility of extending with eight more X_L_R_ uh microphone with another box. So we ha we have the computer, we have the the main box with the eight line ins. And it's fire-wire. It is optic. To have eight more X_L_R_ if we need it. But we are not sure to to buy it now. But we can extend it. That's uh Yes. Well We are not sure about what kind of microphone we should use. So we wanted to uh that's a first thing. So we wanted to have open possibilities. And secondly, maybe we will use this box, you know, to get uh data. I don't know if in another experiment, maybe in another project, if we want to have comments of different people around the table or o elsewhere. Uh getting uh having the possibility of using X_L_R_ microphone could be uh uh could be neat. So everything is open. No, we are not very interested in. More than one Mm-mm mm-mm. Okay. They won't have nice light uh lighting on in front of them. they would maybe careful. But Okay. So maybe w uh you could present a different work uh you have done. And uh I don't know how they are. So you told us it was different pieces, more than one integrated device. Oh uh What would be the d difference between detecting the average or detecting the maximum? What what would be the difference? What is the difference? It's a kind of filtering already? Sixty? Okay. again. Uh-huh. Okay. So it's thirty two millisecond frames taken each An idea. So something like that. So why are you overlapping? I have a rough idea. But uh what is the main interest? Mm-hmm. Yeah, and it's kind of crappy. Precise. Okay. Okay. Mm mm mm. So you use the first and last point to detect what is between. Mm-hmm. Mm-mm. Okay. So you are Mm-hmm. Mm-hmm. Corresponding to breath or a little noise. 'Kay. But you're talking about the microphone array here? Strong background noise. So phase Uh-huh. Uh-huh. Okay. So the phase the time between uh audio signals for um for this microphone on this microph Okay. Okay. Mm-hmm. Mm-hmm. And do you get something about the um distance of the guy that is talking, or just a direction? Okay. Mm-mm. Mm-hmm. Okay. You mean that each person will have a specific frequency because i the way he's talking or his localization? It's a kind of uh speaking styles? Both. I don't really get your last uh diagram, yes. So it's at one given moment still? Okay, so you can say that f this frequency belongs more to this person and this one more to this one? But some frequency may maybe are not used neither by uh guy the guy one or the guy two. Okay. maybe they are using the same frequency. so you need to s Okay. Okay. Okay. And can you detect if someone is laughing or angry d uh there is kind of signature uh even if we can't really trust it uh uh s one hundred percent of the time. Uh And do processing on it. Okay. Pitch. It is pitch. Li okay. What is a? Excuse me? Uh-huh. Okay. Okay. So where are you represent the phase?. Okay. Uh-huh. Uh-huh. Okay. Mm-hmm. Mm-hmm. Mm mm mm. Because if we are distributing the microphone all around the table, then we come back to the energy solution? Is it what you say? With the energy dissolution or with uh F_F_T_ too? Okay. But there are many many things. The difference between the Mm-hmm. Mm mm yeah. What would be the difference? You will not compare each mic with uh all the other one. But just comparing this one with the Mm okay. Okay. But what are uh-uh. So what are this mic for instance, the electret mic. Omni-directional? Electret? Okay. Yes. And when you were talking about comparing energy in the case where a microphone would be distributed all around the table, uh you're comparing energy and the time delay? Because time delay is is too short to be compared or can be neglected? Okay. Okay, okay. Yeah. It's time. Cool.
Okay. Mm-hmm. Okay. Okay. Mm-hmm. Okay. Okay. Yeah. Okay. Yeah. Okay. Mm-hmm. Okay. Mm-hmm. Okay.. Ok okay. Mm-hmm. Mm. So do you know uh which way you give the feedback? Light or I saw something like that. Okay. Hmm. Hmm. Mm-hmm. Mm. Okay. Okay. Yeah. Maybe more complex, yeah. Yeah. So th Hmm. But yeah. To the border basically some distance? Or Okay. O okay. Obviously, yeah. Okay. Okay. Hmm. Yeah. Yeah. Okay. Right. Hmm. Mm. Okay. Mm-hmm. Okay. But don't yeah. Don't you think with a beam also you have less problems with the sheets? Or you don't want to do that. Yeah. Okay. Confusing, yeah. Hmm. Okay. No, it's a choice. Yeah.. Okay. Mm-hmm. Yeah. Mm-hmm. Okay. Yep. Okay. Yeah. But physically it's just uh something flat on the table again, yeah. Okay. Okay. Yeah. Okay. Hmm. Okay. Yeah, so it's different, but still uh yeah, it's same Okay. Okay. Yeah. Okay. Yeah. Mm-hmm. Okay. Mm-hmm. Okay. Mm-hmm. Mm-hmm. Yeah, yeah. Hmm. Right. Hmm. Okay. Right. Mm-hmm. Mm-hmm. Or too long. Oh, just kidding. Okay. Yeah. Yeah. Right. Mm-hmm. Yeah. Ah, so you ha you you might have multiple colours? Or yeah. Sorry, didn't get that. Yeah. Yeah. Alright. Okay. Yeah.. Mm-hmm. Right. Mm-hmm. Okay. Mm-hmm. Okay. Hmm. No, it's quite clear. So uh I guess you want to know what you could do with an array maybe? Mm-hmm. Yeah. Or yeah. Yeah. Yeah, it's good idea. Just give some questions uh Mm-hmm. Right. Mm-hmm. Mm. Mm-hmm. Okay. Mm-hmm. Yeah. Mm-hmm. You mean the top of the wave-form? Okay. Yeah. Oh Okay. Hmm. Hmm. Mm-hmm. Okay. Yeah. Yeah. Hmm. Okay. Mm-hmm. Yeah. Yeah, yeah. Okay. Separation uh Yeah. Hmm. Mm-hmm. Well you need a camera. Obviously. I mean will you have a camera? Maybe you can detect breathing. Yeah. Yeah, that you could do. Or on the table uh there should be well. Okay uh Okay. Mm. Well Okay. Mm-hmm. Okay. So you I think Guillaume said that's optic, something like that? Or Okay. So you you need a different you need a microphone with um a device inside, right? Ho how does it work? Okay. Yeah. Okay. Mm. Hmm. It's another fire-wire or not fire-wire, but um Okay. But concretely you would use that to get higher quality signals? Y that's all.. Yeah. Ah okay, okay. Yeah. Yeah. Hmm. Okay. And how Hmm. Hmm. You mean you meant eight kilo uh sampling frequency? Yeah. Uh we use sixteen. So that's eight kilo band-width. Sixteen kilo uh sampling frequency. It it's not bad. Uh it's quite decent. Um Mm f before answering the questions uh th there's also a point of uh precision of localization if you use an array. So if you use higher frequencies you can get more precise localization. But maybe you don't need it. Uh that I can't answer directly. Uh it depends on your set-up, you know. Um you have to test basically. Um Y you get uh the angle. The most precise one is the asymmet in the horizontal plane. Then you also get elevation. But it's not very precise. It's more of an indication. And radius is very bad. If you use a planar geometry like this one. Yeah. Um you can use multiple ones and do some um um triangulation. Eight, eight. These are only screws. Microphones are on bottom. Yeah, yeah. Mm-hmm. A cube? Okay. Yeah, then you might get uh better direction in elevation. But I'm not sure it's really relevant. Um no. So Uh if people move forward, yeah. Ah, the radius? Th then the r the problem is not the geometry, it's the um I you need more than one basically. With one you will only get direction. Microphone array. So Mm. Direction, yeah. Um Yeah. It's a different option, yeah. Um we haven't tried that because we are not going to have the special kind of table. More uh this is not a good example but we have like a box, just bring the box and put it. So it uh we could not consider that. But in your case it could be interesting, yeah. Unless people put something on the microphone. But um Yeah. Right. But that I'm I think after the meeting we can go and talk to Olivier about that. I'm not the person for hardware. Yeah. Yeah, I'll t I'll try to present uh I guess it corresponds to your questions. Yeah. So the the lowest layer would be the first two questions. Um you asked about how to detect um activity. Currently you're thresholding the t maximum of your wave-form, right. Ah, ah. Okay. And you're looking for a peak. Yeah. Yeah. Yeah. Hmm. Okay. Uh Yeah. Yeah. And your second question uh was about F_F_T_, what can be done with F_F_T_. So um the approach we're taking here is to answer both questions the same time. We use F_F_T_ to do detection. We don't use the wave-form, the raw wave-form. So um We split um we split the signal in small frames, like sixteen millisecond. Sixteen. Um each frame is taking thirty two millisecond of signal, just to give an idea. And there is a one frame every sixteen millisecond, and they overlap of fifty percent. Um Yeah. Each sixteen millisecond. Well th these are details. It's just to give an idea. Um Well it's uh the the c Because usually when you do F_F_T_, y if I'm getting into details, but you you apply um a window to avoid effect. So the beginning and the end it it i it's not crappy, it's just not very much rep represented because you apply um a window which has this shape, having window. So they're very yeah. No, I'm saying uh when you take your signal you take one frame, you apply the window, uh and the window is simply uh coefficients you give. And you give higher coefficient to the middle than to the extreme. So if you don't do overlap y you will lose information at the beginning and the end. Okay. But you don't have to do overlap. I mean uh it's um Hmm. Um C_M_U_, no? Is it from C_M_U_? Um uh yeah. Yeah, yeah. Right. Yeah, yeah. Yeah. F_F_T_ is only a tool. Uh um uh what we do release a phase domain analysis. We only use the phase between the microphones. Um Yeah, but also for detection. Yeah. Um we have a way so um I don't think I should get into details. But basically the beginning is your signal which you slice into frame. You do F_F_T_ on each mic. And the end is um a number of uh frequency beams which are used by each person, to uh explain roughly. S so when you speak, speech is wide band. So um the more active you are the more band-width you use. So you will get a large value of band-width for people who speak. And uh for the others it will random. It will be a small value. Or just background noise. And so uh Yeah. 'Cause one problem if you use energy based methods, which is probably more uh the thing you've been using so far, is that it's not um quite related to location. So you can have um uh for one person you can have uh a high energy signal which is difficult to locate, and visa versa. Um I think in your case you're quite interested in the location. So I would advise to um use more uh phase domain methods. Um Once you have the fifty, for each frequency you have uh the magnitude and the phase. So you can compare the phase of the microphones. And this is directly linked to the direction of the person. Um Sometimes we use it, but not it's not the first thing we use. I know it's a bit counterintuitive. But um maybe I'll go t to the board. Yeah. Okay. R right, for a single channel it's not very meaningful. But here it's relative phase between the microphones. So uh I'll just try to um s summarise. No, that's what I don't do. Um this would be valid for uh like speech recognition with a single channel. That would be a very good idea. But if you choose to use multiple channels um let's say you have four of them, with F_F_T_ you can just uh, as I said, l look at the phase between the microphones at a given frequency. Um it is linked to this value, yeah. Uh basically the time you're mentioning is this value. And this is the frequency. And this is an angle in radians. So uh F_F_T_ gives you a measure of these values. Uh this, I can point to a paper, uh I don't think this is the right place to explain everything. But will give you if you look at your table, uh what we have developed here is an approach where um th uh you divide the space in sectors. For example ten sectors around the table. And uh in each sector you will get the value. No. It's uh application dependent. For example um Well this is not. And on this one you could have a large value. On on on those ones, small values. Uh these are number frequencies which we estimate with several steps from these measures. So to conclude what we are doing is uh we estimate how much of the frequency spectrum you are occupying when you speak, or I am occupying in this direction when I speak. And it turns out that um this is quite good to do uh detection and localization at the same time. Bec because you know in a sector space there is uh uh this much activity. Yeah, yeah. Um more recently um I've worked on uh um prolongating this with a more precise direction evaluation to know where in the sector the person is. So that uh that's not much work. Once you have done this, this is can be this can be done quite quickly. Then you Yeah, y well I I use a value. But then you'll get two large values. Yeah. So No. Just direction. Yeah. Well, f yeah, I it's not uh i it's an inherent problem to the geometry you use. You can for example, if you use another array, uh you can intersect lines of direction. Uh Mm maybe. Uh uh I w mm I wouldn't trust it too much. Classically, what you do is you extract the signal of the person. Um one interesting thing is that these numbers are not uh arbitrary things. They represent the number frequencies where a given person is dominant. But when you have two of them, which might be interesting for separation, you know already when you have done this processing uh w which part of the frequencies of the spectrum um belongs to this person and to that person. And then it's easy to separate the signals. Um both. It's it's more an instantaneous um this is still instantaneous. This is for one time frame. If you look at all your frequencies for example uh zero to four kilohertz, um you can split it and say uh all these parts belong to person one here. And uh all the other parts belong to the other person. That's correct. But then uh you can look statistically it wi it will be negligible um because the r of uh of the difference level. Yeah. It's uh you can listen to it, yeah. And then you can do some uh maybe higher level um analysis where you get the pitch of the person or This Yeah, yeah. And this is Exactly. Yeah. Then it's uh random. And that's why you get these values uh which are random. Yeah. Yeah, but this is very rare. And when it happens, one is always masking the other in practice. So um Yeah. No no. But uh I've I've seen quite a few papers I I've n not done it myself. I'm really on the lowest level. Um but uh once you have done this, as I said, you can uh separate the signals and um do do processing. Uh so uh you can get pitch, rate of speech. That's quite easy. Pitch um the uh uh timbre en Francais. Um so uh for different person this might be quite different. At least for male females. And uh rate of speech. Like if somebody is talking very uh in a very energetic manner, it might be quite fast. Uh or just energy also. One Yeah, also. Um that might be an issue with people are bringing laptops. Um They will b yeah. They might be detected as another another source. Uh so you would have to uh classify the sources as human or machine. But again, yeah, I think once you see the spectrum, if it's just pure and stable uh Also yeah, if if there's no pitch. So uh Yeah. Yeah. So all will go and be detected there. Yeah. So w what you're mentioning is a pain for us. But for you it might be 'cause we're only interesting in getting the speech. But for you it might be quite uh quite good, yeah. So I've to avoid uh filtering and smoothing. 'Cause as soon as you do that um you exclude some of the information. So it's better to keep it for the latest stage. So um um on top of this there's another part uh which is more linked to the tracking. All this was for one time frame. So um Yeah, you can play with it. Um in speech it's uh stationary over ten, twenty millisecond. And we make a stationary assumption, a local stationary T_ assumption um which allows you to use F_F_T_ and blah, blah, blah. Uh now in spite of that I know some people use much longer windows, which is not a problem. But Uh it's a bit too much I think. Uh like some very small words, like yes, they might be two hundred millisecond or three hundred millisecond. Um Yeah, they might be blurred with uh silence. So yes. You you might have to play with it just to save processing time, like use slightly longer frames. Yeah, at m at most, yeah. Um Simply because I've used one hundred as a minimum length of a speech word. Yeah, so. Um now assuming you have done this for each time, i if this is your detail or I should use a different Um if this is your asymmet at each time you get a direction. So asymmet is your direction in horizontal plane. It's an angle. Like north, south, east, west. And Yeah. No no, of this is uh really different. I'm just saying that uh once you have done this for a given time frame you can have a direction of the person, at least in terms of sector. And it's also possible to give a more precise direction quite quickly. So uh this was kind of the lowest layer. Now this is a layer just above it which might interest you. Um if you repeat that over time you will see patterns um for example uh at two different locations. Two different person will speak. So you can cluster those. And you will see that uh if it's long enough it's some significant event. Um this can be done quite cheaply, yeah. Yeah. Yeah. A cost. Yeah, th Um well I use Matlab. So it's not perfect. It's like three, four Yeah. But not everything is uh um simple linear. Especially this part. It's definitely not linear because you're looking at a maximum energy in a frequency, kind of. The most expensive part is here. With Matlab I have three, four times real time. Um you can do sub-optimal processing and uh for example here I'm considering all possible pairs of microphones, uh which is twenty eight, and the processing is directly um r proportional to that. So um you can save on that. But then you lose a little bit on precision. Also I I'm using all the small frames with fifty percent overlap. I don't think you really need absolutely to do that. It's not that good in terms of uh direction. Um Yeah, I would be careful with that. Like five, six is uh decent. Uh six I would say, yeah. Eight, yeah, uh I get down now to one, two degrees uh root mean square error. Uh so you might not need that. That's what I mean. Might not be Um Yeah. Yeah. Yeah. Yeah. Yeah. So it could be sufficient to uh, yeah, define enough sectors and No. No. So it's arbitrary. I use twenty degree sectors. Uh 'cause I had to choose a value. But it's up to you. Um Yeah. Yeah. Um So that's another way, yeah. It it's almost like having a lapel, yeah. Um Yeah. I think it's not necessarily a bad idea, actually. Um Well uh If you are able to calibrate your microphones um the ones who are closer to the person will get more energy. Um so you can compare that. It's possible. So Yeah. Um I have experience with that. Um on the side I've done some single channel work which can maybe help. Um 'cause uh it seems to me that you are not definite on the geometry, right? Yeah. Um m this part is quite specific to a microphone array where like they are concentrated in uh some place, like in the middle of the table. Yeah. Yeah. Yeah. But you can kind of come back to that also. Um Uh F_F_T_ also. Yeah, yeah. R right. Yeah, yeah. It has to be as good as possible. Yeah. Yeah. True. Yeah. We can also consider directional microphones. You know no. Um so you can uh their pattern is uh not uh the same uh depending on the direction where the person is. So for example if there was a directional microphone pointing there you would get most of my energy but less from you, et cetera. You can also do microphone arrays with directional microphones. Uh um might not be relevant. But um Uh More or less. It's it depends on what you want. I mean uh I don't know. Yeah. They're o yeah, omni-directional, yeah. M uh I guess so. It's the same, yeah. Exactly the same, yeah. Yeah. Yeah, yeah. Quite expensive. Yeah, they are um high quality mics. Uh I think we can leave this question for Olivier. Yeah. No, no. Y you would uh assume Uh you would neglect it. The time delay is is very small. It it's only used for uh getting the a direction when you have a microphone array. But in that case it's not relevant. Um you would assume that they are roughly synchronised, not necessarily very precisely, and um compare the energy levels. But I have no experience with that. Um Now I don't know, you don't want to use ah, it's almost okay. Yes.
